Connection weight learning for guided architecture evolution

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for determining one or more neural network architectures of a neural network for performing a video processing neural network task. In one aspect, a method comprises: at each of a plurality of iterations: selecting a parent neural network architecture from a set of neural network architectures; training a neural network having the parent neural network architecture to perform the video processing neural network task, comprising determining trained values of connection weight parameters of the parent neural network architecture; generating a new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture; and adding the new neural network architecture to the set of neural network architectures.

CROSS-REFERENCE TO RELATED APPLICATION

This application is an International Application and claims the benefitof U.S. Application No. 62/852,217, filed May 23, 2019. The disclosureof the foregoing application is hereby incorporated by reference in itsentirety.

BACKGROUND

This specification relates to processing data using machine learningmodels.

Machine learning models receive an input and generate an output, e.g., apredicted output, based on the received input. Some machine learningmodels are parametric models and generate the output based on thereceived input and on values of the parameters of the model.

Some machine learning models are deep models that employ multiple layersof models to generate an output for a received input. For example, adeep neural network is a deep machine learning model that includes anoutput layer and one or more hidden layers that each apply a non-lineartransformation to a received input to generate an output.

SUMMARY

This specification describes a system implemented as computer programson one or more computers in one or more locations for determining aneural network architecture for performing a machine learning task.

According to a first aspect there is provided a method performed by oneor more data processing apparatus for determining a neural networkarchitecture of a neural network for performing a video processingneural network task, the method comprising: maintaining data defining aset of neural network architectures, wherein for each neural networkarchitecture: the neural network architecture comprises a plurality ofblocks, wherein each block is a space-time convolutional blockcomprising one or more neural network layers that is configured toprocess a block input to generate a block output; and for each of one ormore given blocks: (i) the block input for the given block comprises ablock output from each of a plurality of other blocks, (ii) the givenblock has a respective connection weight parameter corresponding to eachof the plurality of other blocks, and (iii) processing the block inputcomprises combining the other block outputs using the connection weightparameters corresponding to the other blocks; at each of a plurality ofiterations: selecting a parent neural network architecture from the setof neural network architectures; training a neural network having theparent neural network architecture to perform the video processingneural network task, comprising determining trained values of theconnection weight parameters of the parent neural network architecture;generating a new neural network architecture based at least in part onthe trained values of the connection weight parameters of the parentneural network architecture; and adding the new neural networkarchitecture to the set of neural network architectures; and after afinal iteration of the plurality of iterations, selecting a final neuralnetwork architecture from the set of neural network architectures basedon a performance metric for the final neural network architecture on thevideo processing neural network task.

In some implementations, each neural network architecture is configuredto process an input comprising (i) a plurality of video frames, and/or(ii) a plurality of optical flow frames corresponding to the pluralityof video frames.

In some implementations, each block processes a block input at arespective temporal resolution to generate a block output having arespective number of channels.

In some implementations, each block comprises one or more dilatedtemporal convolutional layers having a temporal dilation ratecorresponding to the temporal resolution of the block.

In some implementations, each neural network architecture comprisesblocks having different temporal resolutions.

In some implementations, each block comprises one or more residualmodules.

In some implementations, combining the other block outputs using theconnection weight parameters corresponding to the other blockscomprises: for each other block output, scaling the other block outputby the connection weight parameter corresponding to the other block; andgenerating a combined input by summing the scaled other block outputs.

In some implementations, processing the block input further comprisesprocessing the combined input in accordance with a plurality of blockparameters to generate the block output.

In some implementations, generating a new neural network architecturebased at least in part on the trained values of the connection weightparameters of the parent neural network architecture comprisesdetermining which blocks in the new neural network architecture shouldreceive block outputs from which other blocks in the new neural networkarchitecture based at least in part on the trained values of theconnection weight parameters of the parent neural network architecture.

In some implementations, determining which blocks in the new neuralnetwork architecture should receive block outputs from which otherblocks in the new neural network architecture based at least in part onthe trained values of the connection weight parameters of the parentneural network architecture comprises determining, for each given blockin the parent neural network architecture that (i) receives a blockoutput from an other block and (ii) has a connection weight parametercorresponding to the other block having a trained value that exceeds athreshold, that a block in the new neural network architecturecorresponding to the given block should receive a block output from ablock in the new neural network architecture corresponding to the otherblock.

In some implementations, the threshold is a predetermined threshold.

In some implementations, the threshold is sampled in accordance with apredetermined probability distribution.

In some implementations, the method further comprises for each of one ormore pairs of blocks in the new neural network architecture comprising afirst block and a second block, randomly determining whether the secondblock should receive a block output from the first block.

In some implementations, wherein generating the new neural networkarchitecture comprises for each of one or more blocks in the new neuralnetwork architecture that correspond to respective blocks in the parentneural network architecture, applying one or more mutation operations tothe block, wherein the mutation operations comprise: splitting theblock, merging the block with a different block, and adjusting atemporal resolution of the block.

In some implementations, selecting a parent neural network architecturefrom the set of neural network architectures comprises: determining, foreach of a plurality of particular neural network architectures from theset of neural network architectures, a performance metric of a neuralnetwork having the particular neural network architecture that istrained to perform the video processing neural network task; andselecting the parent neural network architecture from among theplurality of particular neural network architectures based on theperformance metrics.

In some implementations, selecting the parent neural networkarchitecture from among the plurality of particular neural networkarchitectures based on the performance metrics comprises selecting theparent neural network architecture as the particular neural networkarchitecture having the highest performance metric on the videoprocessing neural network task.

In some implementations, the method further comprises removing theparticular neural network architecture having the lowest performancemeasure on the video processing neural network task from the set ofneural network architectures.

In some implementations, selecting the final neural network architecturefrom the set of neural network architectures comprises: determining, foreach neural network architecture from the set of neural networkarchitectures, a performance metric of a neural network having theneural network architecture that is trained to perform the videoprocessing neural network task; and selecting the final neural networkarchitecture as the neural network architecture having the highestperformance metric on the video processing neural network task.

In some implementations, the method further comprises providing a neuralnetwork that (i) has the final neural network architecture and (ii) hasbeen trained to perform the video processing neural network task.

In some implementations, for each neural network architecture: eachblock in the neural network architecture is associated with a respectivelevel in a sequence of levels; and for each given block that isassociated with a given level that follows the first level in thesequence of levels, the given block only receives block outputs fromother blocks that are associated with levels that precede the givenlevel.

According to another aspect, there is provided a system comprising oneor more computers and one or more storage devices storing instructionsthat when executed by the one or more computers cause the one or morecomputers to perform the operations of the previously described method.

According to another aspect, there are provided one or more computerstorage media (e.g., non-transitory computer storage media) storinginstructions that when executed by one or more computers cause the oneor more computers to perform the operations of the previously describedmethod.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages.

The system described in this specification can automatically select aneural network architecture that may enable a neural network having thearchitecture to effectively perform a machine learning task, e.g., avideo processing task. The system may select the architecture from aspace of possible architectures that each include multiple blocks ofneural network layers (e.g., space-time convolutional layers) thatprocess block inputs which may be derived from different input streams(e.g., video frame streams or optical flow streams) at respectivetemporal resolutions. As part of selecting the architecture, the systemselects how the blocks in the architecture connect to one another, i.e.,which blocks receive inputs from which other blocks, and can therebycontrol how data flows through the architecture, and when and wherefeatures encoding different information at various levels of abstractionare combined. In contrast, some conventional architecture selectionsystems select an architecture from a less flexible space of possiblearchitectures, e.g., a space where each architecture includes the samemodule of neural network layers sequentially repeated multiple times. Bysearching a complex and flexible space of possible architectures, thesystem described in this specification may select an architecture thatenables a neural network having the architecture to perform machinelearning tasks more effectively, e.g., with a higher predictionaccuracy.

To select the architecture, the system evolves (i.e., updates) apopulation (i.e., set) of neural network architectures over multipleevolutionary iterations. At each evolutionary iteration, the system cangenerate a new neural network architecture from an existing, “parent”neural network architecture based on trained values of connection weightparameters that define the strength of connections between blocks in theparent neural network architecture. For example, the system candetermine that strongly connected blocks in the parent architectureshould still be connected in the new architecture, while connectionsbetween other blocks may be randomly re-configured in the newarchitecture. Guiding the evolution of the population of neural networkarchitectures based on trained values of connection weight parametersmay enable the system to select an architecture that achieves anacceptable performance on a machine learning task over fewerevolutionary iterations, thereby reducing consumption of computationalresources, e.g., memory and computing power.

The details of one or more embodiments of the subject matter of thisspecification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a possible neural network architecture forperforming a video processing task that is generated by an architectureselection system.

FIG. 2 shows an example architecture selection system.

FIG. 3 is a flow diagram of an example process for determining a neuralnetwork architecture of a neural network for performing a videoprocessing neural network task.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

This specification describes an architecture selection system 200 forselecting an architecture for a neural network that is configured toperform a machine learning task.

The machine learning task may be a video processing task, i.e., wherethe neural network processes a video (or features derived from thevideo) to generate an output that characterizes the video. A “video” mayrefer to a sequence of video frames, where each video frame may berepresented as an array of numerical values and corresponds to arespective time point. For example, a video frame may be represented asan array of RBG or CIELAB values.

The input to the neural network may include, e.g., a sequence of videoframes, a sequence of optical flow frames corresponding to the videoframes, or both. Each optical flow frame may characterize the motionbetween a respective pair of video frames, i.e., between a first videoframe and a subsequent video frame. For example, each optical flow framemay specify a respective displacement vector corresponding to each pixelin a first video frame, where the displacement vector estimates adisplacement of the pixel between the first video frame and a subsequentvideo frame. The optical flow frames may be derived from the videoframes using any of a variety of techniques, e.g., the Lucas-Kanademethod.

The neural network may generate any of a variety of outputs thatcharacterize an input video. For example, the neural network maygenerate a classification output that includes a respective score foreach of multiple classes, where the score for a class defines alikelihood that the video is included in the class. In a particularexample, the neural network may generate a classification output thatclassifies an action being performed by a person depicted in the video,and the classification output may specify a respective score for eachaction in a set of possible actions.

The architecture selection system 200, which will be described in moredetail with reference to FIG. 2, selects an architecture for the neuralnetwork from a space of possible architectures. An example architecture100 that was selected by the architecture selection system 200 for aneural network configured to perform video processing tasks isillustrated in FIG. 1. Each possible architecture includes multiple“blocks,” including one or more input blocks and one or moreintermediate blocks. As used throughout this specification, a blockrefers to a group of neural network layers, e.g., that may be arrangedin a sequence.

Each input block (e.g., the input block 102) may be configured toprocess the sequence of video frames (e.g., the video frames 104), thesequence of optical flow frames (e.g., the optical flow frames 106), orboth. Each intermediate block may be configured to process an input thatincludes the outputs of one or more other blocks (e.g., the block 108may be configured to receive inputs from the blocks 102 and 110). Theneural network may generate an output (e.g., the output 112) byprocessing the outputs of one or more blocks of the neural network(e.g., the block 114) by one or more neural network layers, e.g., asequence of layers including a pooling layer, a fully-connected layer,and a soft-max layer.

Each block may be a space-time convolutional block, i.e., a block thatincludes one or more convolutional neural network layers and that isconfigured to process a space-time input to generate a space-timeoutput. Space-time data refers to an ordered collection of numericalvalues, e.g., a tensor of numerical values, which includes multiplespatial dimensions, a temporal dimension, and optionally, a channeldimension. Each block may generate an output having a respective numberof channels, and in the neural network illustrated in FIG. 1, the numberof channels in the output of each block is denoted by C, e.g., where theblock 102 generates a block output with C=32 channels.

Each block may include, e.g., spatial convolutional layers (i.e., havingconvolutional kernels that are defined in the spatial dimensions),space-time convolutional layers (i.e., having convolutional kernels thatare defined across the spatial and temporal dimensions), and temporalconvolutional layers (i.e., having convolutional kernels that aredefined in the temporal dimension).

The temporal convolutional layers may be “dilated” temporalconvolutional layers, i.e., that generate a layer output by convolvingthe layer input with a kernel defined in the temporal dimension, wherethe convolution skips inputs (i.e., along the temporal dimension)according to a step size referred to as the dilation rate. A dilatedtemporal convolutional layer may be said to process a space-time inputat a “temporal resolution” that is defined by the dilation rate, e.g.,such that a higher dilation rate may correspond to a lower temporalresolution. In some cases, each temporal convolutional layer in a blockmay be configured to process space-time inputs at the same temporalresolution, which may be referred to as the temporal resolution of theblock, and different blocks may have different temporal resolutions. Inthe neural network illustrated with reference to FIG. 1, the dilationrate of each block is denoted by r, e.g., where the block 102 hasdilation rate r=4. The temporal resolution of a block may influence thenature of the information encoded in the output generated by the block,and blocks having different temporal resolutions may generate blockoutputs that encode complementary information, which may improve theperformance of the neural network.

The spatial convolution layers may be “strided” convolutional layers,i.e., that generate a layer output by convolving the layer input with akernel defined in the spatial dimension, where the convolutional windowtraverses the input (i.e., along the spatial dimension) according to astep size referred to as the stride.

In some implementations, the neural network layers in each block may bearranged into “residual modules”. Each residual module includes one ormore neural network layers, and the output of the residual module may bethe sum of: (i) the input to the first layer of the residual module, and(ii) the output of the last layer of the residual module. Using residualmodules may stabilize the training of the neural network, and therebyimprove the performance of the trained neural network.

A given block that is configured to receive inputs from other blocks maymaintain a respective “connection weight” parameter corresponding toeach of these other blocks. The given block may combine the inputs fromthe other blocks in accordance with the connection weight parameters togenerate a combined input, and then process the combined input (e.g., byone or more convolutional neural network layers) to generate a blockoutput. The given block may generate the combined input F^(in), e.g.,as:

$\begin{matrix}{F^{in} = {\sum\limits_{i = {1\mspace{14mu}\ldots\mspace{14mu} n}}{{sigmoid}\mspace{14mu}{\left( w_{i} \right) \cdot F_{i}^{out}}}}} & (1)\end{matrix}$

where i indexes the n other blocks that provide inputs to the givenblock, w_(i) denotes the connection weight parameter corresponding blocki, and F_(i) ^(out) denotes the input received from block i. Combiningthe inputs from the other blocks, e.g., in accordance with equation (1),may require the inputs from the other blocks to have the same spatialdimensionality and channel dimensionality. If the inputs from the otherblocks have different spatial dimensionalities, the given block mayapply pooling operations (e.g., max-pooling operations) to the inputsfrom the other blocks to cause their spatial dimensionalities to match.If the inputs from the other blocks have different channeldimensionalities, the given block may projection operations (e.g.,implemented by spatial convolutional layers with 1×1-dimensionalkernels) to cause their channel dimensionalities to match.

In some implementations, each block in the neural network may beassociated with a respective level, e.g., the levels 116-124, and eachblock in a given level only receives inputs from blocks in lower levels.For example, each block in level 3 (120) may receive inputs from blocksin level 1 (116) and level 2 (118), but not level 3 (120), level 4(122), or level 5 (124).

As part of selecting the architecture of a neural network, thearchitecture selection system may select: the number of blocks in eachlevel of the architecture, the temporal resolution of each block, thenumber of channels in the output generated by each block, and whichblocks receive inputs from which other blocks. To this end, thearchitecture selection system performs an automated search through thespace of possible architectures to select an architecture that enablesthe neural network to effectively perform the machine learning task, aswill be described in more detail with reference to FIG. 2.

While this specification primarily describes the architecture selectionsystem as selecting a neural network architecture for performing a videoprocessing task, more generally, the architecture selection system canbe used to select a neural network architecture for performing any of avariety of neural network tasks. Examples of other neural network tasksinclude, e.g.: semantic segmentation tasks, image classification tasks,object detection tasks, action selection tasks, natural languageprocessing tasks, and speech recognition tasks. Moreover, while thisspecification primarily describes the blocks as being space-timeconvolutional blocks, more generally, the blocks can include neuralnetwork layers of any appropriate type. Moreover, while thisspecification primarily describes selecting a neural networkarchitecture that is configured to process video frames and/or opticalflow frames, the architecture selection system can be used to selectneural network architectures that process any appropriate type of data,e.g., data including a sequence of point clouds (e.g., generated by aLidar sensor), and/or data including a sequence of hyperspectral images(e.g., generated by a hyper-spectral sensor).

FIG. 2 shows an example architecture selection system 200. Thearchitecture selection system 200 is an example of a system implementedas computer programs on one or more computers in one or more locationsin which the systems, components, and techniques described below areimplemented.

The system 200 is configured to select a “final” neural networkarchitecture 202 that enables a neural network having the architectureto effectively perform a machine learning task, e.g., a video processingtask, as described above. To this end, the system 200 maintains apopulation (i.e., a set) of possible neural network architectures 204,and updates the population 204 at each of one or more iterations, whichwill be referred to herein as “evolutionary” iterations. In particular,the system 200 updates the population of architectures 204 to increase alikelihood that the population 204 includes architectures havingsuperior (e.g., higher) performance metrics. The performance metric foran architecture may characterize a performance (e.g., a predictionaccuracy) of a neural network having the architecture on the machinelearning task. After a final evolutionary iteration, the system 200 mayselect an architecture from the population 204 as the final architecture202, as will be described in more detail below.

The system 200 includes the population of possible architectures 204, aselection engine 206, and an architecture generation engine 208.

The population of possible architectures 204 includes multiple possiblearchitectures and is maintained by the system 200 across evolutionaryiterations. Each possible architecture may be, e.g., a video processingarchitecture that includes multiple space-time convolutional blocks, asdescribed with reference to FIG. 1.

Prior to the first evolutionary iteration, the system 200 may initializethe population of architectures 204 by randomly generating a predefinednumber (e.g., 20, or any other appropriate number) of architectures, andadding the generated architectures to the population 204.

To randomly generate a possible architecture, the system 200 mayinitialize the architecture with a predefined number of blocks (e.g., 2blocks, or any other appropriate number of blocks) at each of apredefined number of levels in the architecture (e.g., 5 levels, or anyother appropriate number of levels). The architecture of each block(e.g., the number, type, and configuration of the neural network layerswithin the block) may be randomly selected from a predefined set ofpossible block architectures. Each possible block architecture mayinclude dilated temporal convolutional neural network layers having adilation rate from a set of possible dilation rates, e.g., r ∈ {1, 2, 4,8}. Each possible block architecture may include strided convolutionalneural network layers having a stride from a set of possible strides,e.g., s ∈ {1, 2, 4, 8}. Each possible block architecture may beconfigured to generate a block output having a number of channels from aset of possible channel dimensionalities, e.g., C ∈ {32, 64, 128, 256}.

After initializing the blocks at each level in the randomly generatedneural network architecture, the system 200 may apply a random number of“splitting” or “merging” operations to randomly selected blocks in thearchitecture. Applying a splitting operation to a block may refer toreplacing the block by two new blocks, e.g., where each new block may beconfigured to generate an output with only half as many channels as theoriginal block, but have the same temporal resolution as the originalblock. Applying a merging operation to a pair of blocks may refer toreplacing the pair of blocks by a single new block. The new block may beconfigured to generate an output with a number of channels equal to thesum of the respective number of channels in the output of each originalblock in the pair of blocks being merged. The temporal resolution of thenew block may be randomly chosen from between the temporal resolution ofeach original block.

As part of randomly generating the possible architecture, the system 200also determines which blocks should provide inputs to which otherblocks, i.e., the system determines which blocks should be “connected”.As used throughout this specification, a first block may be said to beconnected to a second block if the second block is configured to receivean input from the first block. The system 200 may determine which blocksshould be connected in the randomly generated architecture, e.g., byadding each possible connection to the architecture with a predefinedprobability, e.g., p=0.5. A connection between blocks is referred to as“possible” only if it specifies that a block at a higher level shouldreceive an input from a block at a lower level.

At each evolutionary iteration, the selection engine 206 selects a“parent” neural network architecture from the population ofarchitectures 204. To select the parent architecture 210, the selectionengine 206 may randomly sample a predefined number of “candidate”architectures from the population of architectures 204, and determine arespective performance metric for each candidate architecture.

To determine the performance metric for a candidate architecture, theselection engine 206 may train a neural network having the candidatearchitecture to perform the machine learning task by training the neuralnetwork on a set of training data to determined trained values of theneural network parameters. In particular, the selection engine trainsthe neural network to determine trained values of the connection weightparameters corresponding to the connections between the blocks in thecandidate architecture, as well as determining trained values of otherneural network parameters, e.g., the parameters of the neural networklayers within each block.

The set of training data may include multiple training examples, whereeach training example specifies: (i) a training input to the neuralnetwork, and (ii) a target output that should be generated by the neuralnetwork by processing the training input. For example, each trainingexample may include a training input that specifies a sequence of videoframes and/or a corresponding sequence of optical flow frames, and atarget classification output, e.g., that indicates an action beingperformed by a person depicted in the video frames. The selection engine206 may train the neural network using any appropriate machine learningtraining technique, e.g., stochastic gradient descent, where gradientsof an objective function are backpropagated through the neural networkat each of one or more training iterations. The objective function maybe, e.g., a cross-entropy objective function, or any other appropriateobjective function.

It will be appreciated that the neural network may be trained for videoprocessing tasks other than classification tasks by a suitable selectionof training data and/or loss function. For example, the neural networkcan be trained for super resolution (in the spatial and/or temporaldomain) using a training set comprising down-sampled videos andcorresponding higher-resolution ground-truth videos, with a lossfunction that compares output of the neural network to ahigher-resolution ground-truth video corresponding to the down-sampledvideo input to the neural network, e.g. an L1 or L2 loss. As a furtherexample, the neural network can be trained to remove one or more typesof image/video artefact from videos, such as blocking artefacts that maybe introduced during video encoding. In this example, the trainingdataset may comprise a set of ground truth videos, each with one or morecorresponding “degraded” videos (i.e. with one or more types of artefactintroduced), with a loss function that compares output of the neuralnetwork to a the ground-truth video corresponding to the degraded videoinput to the neural network, e.g. an L1 or L2 loss.

After training a neural network having the candidate architecture on theset of training data, the selection engine 206 may determine theperformance metric for the candidate architecture by evaluating theperformance of the trained neural network on a set of validation data.The validation data may also include multiple training examples, asdescribed above, but is generally separate from the training data, i.e.,such that the neural network is not trained on the validation data. Theselection engine 206 may evaluate the performance of the trained neuralnetwork on the set of validation data using an objective function thatmeasures the prediction accuracy of the trained neural network, e.g., across-entropy objective function. The selection engine 206 may determinethe performance metric for the candidate architecture, e.g., as anaverage of the value of the objective function across the set ofvalidation data when the training inputs of the validation data areprocessed using the trained neural network having the candidatearchitecture.

The selection engine 206 may identify the parent architecture 210 forthe current evolutionary iteration based on the performance metrics forthe candidate architectures selected from the population ofarchitectures 204 at the current evolutionary iteration. For example,the selection engine 206 may identify the candidate architecture havingthe “best” (e.g., highest) performance metric as the parent architecture210 for the current evolutionary iteration. Moreover, the selectionengine 206 may also remove one or more of the candidate architecturesfrom the population of architectures 204. For example, the selectionengine 206 may remove the candidate architecture having the “worst”(e.g., lowest) performance metric from the population of architectures204.

In some cases, some or all of the candidate architectures may have beentrained to perform the machine learning task at previous evolutionaryiterations, and the architecture selection system 200 may have storedthe previously generated performance metrics for these architectures. Inthese cases, the selection engine 206 may reuse the previously generatedperformance metrics for these architectures, rather than generating themagain at each evolutionary iteration.

At each evolutionary iteration, the architecture generation engine 208is configured to generate a “new” neural network architecture based onboth: (i) the parent architecture 210, and (ii) the trained values ofthe connection weight parameters 212 of the parent architecture 210. Inparticular, the architecture generation engine 208 determines whichblocks in the new architecture should be connected (i.e., which blocksshould receive inputs from which other blocks) based at least in part onthe trained values of the connection weight parameters 212 of the parentarchitecture 210.

For example, the architecture generation engine 208 may initialize thenew architecture 214 by generating a copy of the parent architecture 210and setting the new architecture 214 equal to the copy of the parentarchitecture 210. The architecture generation engine 208 may then modifythe connections between the blocks of the new architecture 214 based onthe connection weight parameter values 212 of the parent architecture210. For example, the architecture generation engine 208 may prune(remove) each connection from the new architecture 214 that correspondsto a connection weight parameter value 212 that is below a thresholdvalue. That is, the architecture generation engine 208 may maintain onlythe connections in the new architecture 214 that correspond toconnection weight parameter values 212 that exceed a threshold value,while removing the other connections. The threshold value may be apredefined threshold value (e.g., 0.5), or a threshold value that isdynamically determined for each connection by sampling from a predefinedprobability distribution, e.g., a uniform probability distribution overthe interval [0,1].

The architecture generation engine 208 may also add new connections tothe new architecture 214. For example, for each possible connection thatis not included in the parent architecture 210, the architecturegeneration engine 208 may add the possible connection to the newarchitecture with a predefined probability. The predefined probabilitymay be a ratio of: (i) the number of connections that were pruned fromthe new architecture, and (ii) the number of possible edges that werenot included in the parent architecture 210. Adding new connections tothe new architecture 214 with this probability may result in the newarchitecture 214 having, on average, the same total number ofconnections as the parent architecture 210.

After modifying the connections between the blocks of the newarchitecture 214, the architecture generation engine 208 may apply oneor more “mutation” operations to randomly selected blocks of the newarchitecture 214. The mutation operations may include, e.g.: blocksplitting operations, block merging operations, and operations to changethe dilation rate of temporal convolutional layers in blocks. Inapplying a splitting operation to a block to generate two new blocks,the architecture generation engine 208 may determine that each new blockhas the same connections as the original block. That is, each new blockmay be configured to receive inputs from the same source(s) as theoriginal block, and provide its output to the same destination(s) as theoriginal block. In applying a merging operation to a pair of blocks, thearchitecture generation engine 208 may determine that the new block hasall the connections of both the original blocks. That is, the new blockmay be configured to receive inputs from the same source(s) as eachoriginal block, and provide its output to the same destination(s) aseach original block. The architecture generation engine 208 may changethe dilation rate of the temporal convolutional layers in a block, e.g.,by randomly sampling a new dilation rate for the temporal convolutionallayers in the block from a set of possible temporal dilation rates.

The system 200 may add the new architecture 214 generated by thearchitecture generation engine 208 to the population of architectures204, and then proceed to the next evolutionary iteration.

Generating the new architectures 214 based on the learned connectionweight parameter values 212 of the parent architectures 210 may enablethe system 200 to generate new architectures 214 that maintain some ofthe properties which make the parent architecture 210 effective atperforming the machine learning task. Simultaneously, new connectionsare randomly added to the new architectures 214, thereby enabling thesystem 200 to explore variations of the parent architectures 210 thatmay be more effective at performing the machine learning task. Guidingthe evolution of the population of architectures based on trained valuesof connection weight parameters may increase the likelihood of thesystem generating new architectures 214 having superior (e.g., higher)performance metrics.

After a final evolutionary iteration, the system 200 may select one ormore of the architectures from the population of architectures 204 asfinal architectures 202. For example, the system 200 may identify thefinal architecture 202 as being the architecture with the best (e.g.,highest) performance metric from among the population of architectures204. As another example, the system 200 may identify a predefined numberof architectures with the best performance metrics from among thepopulation of architectures 204 as the final architectures.

The final architecture(s) 202 can be used in any of a variety of ways.For example, a neural network having the final architecture 202 may bedeployed to perform the machine learning task. As another example, thefinal architecture(s) 202 may be used to initialize a subsequentarchitecture refinement procedure, e.g., that modifies the finalarchitecture(s) 202 to determine other architecture(s) for performingthe machine learning task. As another example, multiple finalarchitectures may be combined to form an ensemble model for performingthe machine learning task, e.g., such that the output of each finalarchitecture is combined (e.g., averaged) to determine an ensembleoutput.

FIG. 3 is a flow diagram of an example process 300 for determining aneural network architecture of a neural network for performing a videoprocessing neural network task. For convenience, the process 300 will bedescribed as being performed by a system of one or more computerslocated in one or more locations. For example, an architecture selectionsystem, e.g., the architecture selection system 200 of FIG. 2,appropriately programmed in accordance with this specification, canperform the process 300.

The system maintains data defining a set of neural network architectures(302). Each neural network architecture includes multiple of blocks,where each block is a space-time convolutional block including one ormore neural network layers that is configured to process a block inputto generate a block output. For each of one or more given blocks: (i)the block input for the given block includes a block output from each ofone or more other blocks, (ii) the given block has a respectiveconnection weight parameter corresponding to each of the one or moreother blocks, and (iii) processing the block input includes combiningthe other block outputs using the connection weight parameterscorresponding to the other blocks.

The system performs steps 304-310 at each of one or more evolutionaryiterations.

The system selects a parent neural network architecture from the set ofneural network architectures (304).

The system trains a neural network having the parent neural networkarchitecture to perform the video processing neural network task,including determining trained values of the connection weight parametersof the parent neural network architecture (306).

The system generates a new neural network architecture based at least inpart on the trained values of the connection weight parameters of theparent neural network architecture (308).

The system adds the new neural network architecture to the set of neuralnetwork architectures (310).

If the current evolutionary iteration is not the final evolutionaryiteration, the system returns to step 304. If the current evolutionaryiteration is the final evolutionary iteration, the system selects afinal neural network architecture from the set of neural networkarchitectures based on a performance metric for the final neural networkarchitecture on the video processing neural network task (312).

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed on it software, firmware, hardware, or acombination of them that in operation cause the system to perform theoperations or actions. For one or more computer programs to beconfigured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially-generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application-specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub-programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification the term “engine” is used broadly to refer to asoftware-based system, subsystem, or process that is programmed toperform one or more specific functions. Generally, an engine will beimplemented as one or more software modules or components, installed onone or more computers in one or more locations. In some cases, one ormore computers will be dedicated to a particular engine; in other cases,multiple engines can be installed and running on the same computer orcomputers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read-only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto-optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using a machinelearning framework, e.g., a TensorFlow framework, a Microsoft CognitiveToolkit framework, an Apache Singa framework, or an Apache MXNetframework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

What is claimed is:
 1. A method performed by one or more data processing apparatus for determining a neural network architecture of a neural network for performing a video processing neural network task, the method comprising: maintaining data defining a set of neural network architectures, wherein for each neural network architecture: the neural network architecture comprises a plurality of blocks, wherein each block is a space-time convolutional block comprising one or more neural network layers that is configured to process a block input to generate a block output; and for each of one or more given blocks: (i) the block input for the given block comprises a block output from each of a plurality of other blocks, (ii) the given block has a respective connection weight parameter corresponding to each of the plurality of other blocks, and (iii) processing the block input comprises combining the other block outputs using the connection weight parameters corresponding to the other blocks; at each of a plurality of iterations: selecting a parent neural network architecture from the set of neural network architectures; training a neural network having the parent neural network architecture to perform the video processing neural network task, comprising determining trained values of the connection weight parameters of the parent neural network architecture; generating a new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture; and adding the new neural network architecture to the set of neural network architectures; and after a final iteration of the plurality of iterations, selecting a final neural network architecture from the set of neural network architectures based on a performance metric for the final neural network architecture on the video processing neural network task.
 2. The method of claim 1, wherein each neural network architecture is configured to process an input comprising (i) a plurality of video frames, and/or (ii) a plurality of optical flow frames corresponding to the plurality of video frames.
 3. The method of claim 1 wherein each block processes a block input at a respective temporal resolution to generate a block output having a respective number of channels.
 4. The method of claim 3, wherein each block comprises one or more dilated temporal convolutional layers having a temporal dilation rate corresponding to the temporal resolution of the block.
 5. The method of claim 3, wherein each neural network architecture comprises blocks having different temporal resolutions.
 6. The method of claim 1, wherein each block comprises one or more residual modules.
 7. The method of claim 1, wherein combining the other block outputs using the connection weight parameters corresponding to the other blocks comprises: for each other block output, scaling the other block output by the connection weight parameter corresponding to the other block; and generating a combined input by summing the scaled other block outputs.
 8. The method of claim 7, wherein processing the block input further comprises: processing the combined input in accordance with a plurality of block parameters to generate the block output.
 9. The method of claim 1, wherein generating a new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture comprises: determining which blocks in the new neural network architecture should receive block outputs from which other blocks in the new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture.
 10. The method of claim 9, wherein determining which blocks in the new neural network architecture should receive block outputs from which other blocks in the new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture comprises: determining, for each given block in the parent neural network architecture that (i) receives a block output from an other block and (ii) has a connection weight parameter corresponding to the other block having a trained value that exceeds a threshold, that a block in the new neural network architecture corresponding to the given block should receive a block output from a block in the new neural network architecture corresponding to the other block.
 11. The method of claim 10, wherein the threshold is a predetermined threshold.
 12. The method of claim 10, wherein the threshold is sampled in accordance with a predetermined probability distribution.
 13. The method of claim 9, further comprising: for each of one or more pairs of blocks in the new neural network architecture comprising a first block and a second block, randomly determining whether the second block should receive a block output from the first block.
 14. The method of claim 1, wherein generating the new neural network architecture comprises: for each of one or more blocks in the new neural network architecture that correspond to respective blocks in the parent neural network architecture, applying one or more mutation operations to the block, wherein the mutation operations comprise: splitting the block, merging the block with a different block, and adjusting a temporal resolution of the block.
 15. The method of claim 1, wherein selecting a parent neural network architecture from the set of neural network architectures comprises: determining, for each of a plurality of particular neural network architectures from the set of neural network architectures, a performance metric of a neural network having the particular neural network architecture that is trained to perform the video processing neural network task; and selecting the parent neural network architecture from among the plurality of particular neural network architectures based on the performance metrics.
 16. The method of claim 15, wherein selecting the parent neural network architecture from among the plurality of particular neural network architectures based on the performance metrics comprises: selecting the parent neural network architecture as the particular neural network architecture having the highest performance metric on the video processing neural network task.
 17. The method of claim 15, further comprising: removing the particular neural network architecture having the lowest performance measure on the video processing neural network task from the set of neural network architectures.
 18. The method of claim 1, wherein selecting the final neural network architecture from the set of neural network architectures comprises: determining, for each neural network architecture from the set of neural network architectures, a performance metric of a neural network having the neural network architecture that is trained to perform the video processing neural network task; and selecting the final neural network architecture as the neural network architecture having the highest performance metric on the video processing neural network task.
 19. (canceled)
 20. (canceled)
 21. A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to perform operations for determining a neural network architecture of a neural network for performing a video processing neural network task, the operations comprising: maintaining data defining a set of neural network architectures, wherein for each neural network architecture: the neural network architecture comprises a plurality of blocks, wherein each block is a space-time convolutional block comprising one or more neural network layers that is configured to process a block input to generate a block output and for each of one or more given blocks: (i) the block input for the given block comprises a block output from each of a plurality of other blocks, (ii) the given block has a respective connection weight parameter corresponding to each of the plurality of other blocks, and (iii) processing the block input comprises combining the other block outputs using the connection weight parameters corresponding to the other blocks; at each of a plurality of iterations: selecting a parent neural network architecture from the set of neural network architectures; training a neural network having the parent neural network architecture to perform the video processing neural network task, comprising determining trained values of the connection weight parameters of the parent neural network architecture; generating a new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture; and adding the new neural network architecture to the set of neural network architectures; and after a final iteration of the plurality of iterations, selecting a final neural network architecture from the set of neural network architectures based on a performance metric for the final neural network architecture on the video processing neural network task.
 22. One or more computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for determining a neural network architecture of a neural network for performing a video processing neural network task, the operations comprising: maintaining data defining a set of neural network architectures, wherein for each neural network architecture: the neural network architecture comprises a plurality of blocks, wherein each block is a space-time convolutional block comprising one or more neural network layers that is configured to process a block input to generate a block output and for each of one or more given blocks: (i) the block input for the given block comprises a block output from each of a plurality of other blocks, (ii) the given block has a respective connection weight parameter corresponding to each of the plurality of other blocks, and (iii) processing the block input comprises combining the other block outputs using the connection weight parameters corresponding to the other blocks; at each of a plurality of iterations: selecting a parent neural network architecture from the set of neural network architectures; training a neural network having the parent neural network architecture to perform the video processing neural network task, comprising determining trained values of the connection weight parameters of the parent neural network architecture; generating a new neural network architecture based at least in part on the trained values of the connection weight parameters of the parent neural network architecture; and adding the new neural network architecture to the set of neural network architectures; and after a final iteration of the plurality of iterations, selecting a final neural network architecture from the set of neural network architectures based on a performance metric for the final neural network architecture on the video processing neural network task. 